Rank based statistics in analyzing high-throughput genomic data
نویسندگان
چکیده
Acknowledgments I would like to thank my advisors Dr. Zohar Yakhini from Agilent Labs and the Technion and Prof. Benny Chor from Tel Aviv University for their guidance and support. I would especially like to thank Zohar for giving me the opportunity to work in this exciting field of computational biology. Specifically, for his dedicated attention, his inexhaustible patience, for the captivating numerous discussions on math, computer science, biology and their integration and for teaching me that making science is not limited only to the borders of the office/lab. Special thanks goes to my lab member – Israel Steinfeld, for his friendship, for being an important part of all the studies described in this thesis and for numerous fruitful discussions. During my studies I found myself battling cancer, not only scientifically. I would like to thank my colleagues, advisors, Agilent Technologies Israel, friends and family for being there for me when I needed them the most. Last, but not least, I would like to thank my parents Ruth and Gil Navon for raising me to be curious and for showing me the passion for science and learning. Abstract High throughput methods (such as microarrays) have revolutionized experimental biology. New statistical methods that facilitate analysis of the vast amounts of data generated by these experiments became a critical part of general study workflow. Gaps and variants that address specific situations are the subject of ongoing research and development efforts. Such methods were, indeed, developed and optimized to address various aspects of high-throughput genomic data analysis. This thesis focuses on specific aspects of expression data analysis. Chapter 2, describes GOrilla-a GO enrichment tool that provides statistically sound methods for identifying GO terms enriched at the top of a ranked list of genes. Chapter 3 addresses semi-supervised class discovery – finding meaningful partitions in the data based on genomic and phenotypic data. The third method, presented in Chapter 4 – RCoS, is a rank consistency score, which is useful in analyzing matched data. The statistical methods presented in the thesis are not limited to the analysis of gene expression, but the examples used are based on gene (and miRNA) expression. The methods presented herein are currently in use by a growing number of bioinformaticians and data analysis professionals. For example, GOrilla had over 5,000 visits in the first year following the publication of the paper.
منابع مشابه
A flexible rank-based framework for detecting copy number aberrations from array data
MOTIVATION DNA copy number aberration--both inherited and sporadic--is a significant contributor to a variety of human diseases. Copy number characterization is therefore an area of intense research. Probe hybridization-based arrays are important tools used to measure copy number in a high-throughput manner. RESULTS In this article, we present a simple but powerful nonparametric rank-based ap...
متن کاملHybrid Bayesian-rank integration approach improves the predictive power of genomic dataset aggregation
MOTIVATION Modern molecular technologies allow the collection of large amounts of high-throughput data on the functional attributes of genes. Often multiple technologies and study designs are used to address the same biological question such as which genes are overexpressed in a specific disease state. Consequently, there is considerable interest in methods that can integrate across datasets to...
متن کاملISRNA: an integrative online toolkit for short reads from high-throughput sequencing data
UNLABELLED Integrative Short Reads NAvigator (ISRNA) is an online toolkit for analyzing high-throughput small RNA sequencing data. Besides the high-speed genome mapping function, ISRNA provides statistics for genomic location, length distribution and nucleotide composition bias analysis of sequence reads. Number of reads mapped to known microRNAs and other classes of short non-coding RNAs, cove...
متن کاملSavant: genome browser for high-throughput sequencing data
MOTIVATION The advent of high-throughput sequencing (HTS) technologies has made it affordable to sequence many individuals' genomes. Simultaneously the computational analysis of the large volumes of data generated by the new sequencing machines remains a challenge. While a plethora of tools are available to map the resulting reads to a reference genome, and to conduct primary analysis of the ma...
متن کاملDetecting and Analyzing Genomic Structural Variation Using Distributed Computing
Genomic structural variations are an important class of genetic variants with a wide variety of functional impacts. The detection of structural variations using high-throughput short-read sequencing data is a difficult problem, and published algorithms do not provide the sensitivity and specificity required in research and clinical settings. Meanwhile, high-throughput sequencing is rapidly gene...
متن کامل